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Abstract — The Indus script is one of the major undeciphered scripts of the ancient world. The small size of the corpus, the absence of 
bilingual texts, and the lack of definite knowledge of the underlying language has frustrated efforts at decipherment since the discovery 
of the remains of the Indus civilisation. Recently, some researchers have questioned the premise that the Indus script encodes spoken 
language. Building on previous statistical approaches, we apply the tools of statistical language processing, specifically n-gram Markov 
chains, to analyse the Indus script for syntax. Our main results are that the script has well-defined signs which begin and end texts, 
that there is directionality and strong correlations in the sign order, and that there are groups of signs which appear to have identical 
syntactic function. All these require no a priori suppositions regarding the syntactic or semantic content of the signs, but follow directly 
from the statistical analysis. Using information theoretic measures, we find the information in the script to be intermediate between that 
of a completely random and a completely fixed ordering of signs. Our study reveals that the Indus script is a structured sign system 
showing features of a formal language, but, at present, cannot conclusively establish that it encodes natural language. Our n-gram 
Markov model is useful for predicting signs which are missing or illegible in a corpus of Indus texts. This work forms the basis for the 
development of a stochastic grammar which can be used to explore the syntax of the Indus script in greater detail. 

Index Terms — Formal languages, Markov processes, Sequences 



1 Introduction 

The civilisation that flourished from 7000 BCE to 1500 
BCE in the valley of the river Indus and its surroundings 
was the first urban civilisation in the Indian subconti- 
nent. At its peak, in the period between 2600 BCE and 
1900 BCE, it covered approximately a million square 
kilometers |1], making it the largest urban civilisation of 
the ancient world. The remains of the civilisation were 
first found in Harappa and, following historical con- 
vention, is called the Harappan civilisation. The terms 
Harappan civilisation and Indus civilisation are used 
interchangeably in this paper. 

The Indus people used a system of signs, which has 
mainly survived on seals (see Fig. __ for an example), 
pottery, and other artifacts made of durable materials 
such as stone, terracotta, copper. The sign system is 
yet to be deciphered. The use of these signs is usually 
in short texts, numbering not more than 14 signs in 
a single line of text. Around 400 distinct signs have 
been identified Q, J3|, though some researchers identify 
up to 676 distinct signs |4j. The total texts number to 
about 3000. The lack of decipherment of the sign system 
is attributed to the paucity of the working material, 




Fig. 1. An example of an Indus seal, showing the three 
typical components: the Indus script at the top, a field 
symbol in the middle, and a decorated object at the 
bottom left. Here, since the script is embossed on a seal, 
it is to be read from the left to the right, whereas on 
the sealing, which are impressions of the seal, it is read 
from the right to the left. For the most part, the seals are 
typically between 1 to 2 square inches in size. 
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the absence of any bilingual texts that might aid in 
decipherment, and a lack of definite knowledge of the 
underlying language, assuming the script was used to 
write language. The sign system of the Indus remains 
controversial, with contested claims of decipherment, 
and even the suggestion that it does not encode language 
[5j. Here we shall use the terms sign system and script 
interchangeably to refer to the Indus script. 
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The main methodological difficulty in attempting any 
interpretation of the script is that, due to the paucity of 
information on the context of the writing, one is perforce 
required to make an a priori assumption regarding the 
information the script is meant to convey Thus, the 
range of opinion on what the script encodes varies from 
an Indo- Aryan language [6], a Dravidian language 0, 
a purely numerical system [7], to non-linguistic symbols 
[5]. There is no consensus on any of the above interpre- 
tations. 

A line of research, which does not require such a 
priori assumptions, is the statistical approach. At its heart 
lies the notion of recognising patterns through count- 
ing. While such an approach cannot shed light on the 
semantics of the script, it can reveal important features 
of its syntax. Research using the statistical approach was 
initiated by a Soviet team in 1965, further developed by a 
Finnish team in 1969 (for review of various attempts see 
0, 13, GDI), continued by Siromoney |Q3) in the 1980s 
and followed up more recently by some of us (12), lH3l , 

El 

In this article, we bring to bear the tools of n-gram 
modelling for a thorough statistical analysis of sequences 
in the Indus script. An n-gram is a sequence of n units, 
or tokens. TV-gram models provide a very useful method 
of modelling sequence^ whether they be the sequence 
of letters or words in a natural language, the base pairs 
in the genetic code, or the notes in a musical score. 
This generality is possible because n-gram models are 
indifferent to the semantic content of the units or tokens 
making up the sequence (the words, the letters, the 
base pairs or the notes) but, nonetheless, reveal the 
syntax, if any, that the sequences follow. The n-gram 
approach, then, forms the ideal framework in which the 
Indus script can be analysed without making any a priori 
assumptions. 

Our work in the paper uses a bigram model, that is 
an n-gram model with n = 2, to model the sequence 
of signs in the Indus script. Our analysis reveals that 
the sign system of the Indus is structured, with specific 
text beginner and text ender signs, that there is direc- 
tionality, strong correlations in the order of the signs, 
and groups of signs which appear to have the same 
syntactic function. In Section |2j we provide details of 
the corpus used in this study and present results of an 
empirical statistical analysis of the corpus. The empirical 
analysis shows that a small number of signs account for 
the majority of the usage, a fact we are able to capture 
by fitting the frequency distribution of signs to the well- 
known Zipf-Mandelbrot law. We note in passing, that 
word frequency distributions in natural languages also 
follow this law. In Section [3] we provide a summary of 
the theory of n-gram models, paying special attention to 

1. The examples of English language used in the paper to explain 
various concepts related to n-gram modelling are given for the ease of 
comprehension, as a representative of the general class of sequences. 
These linguistic examples do not imply that any linguistic assumption 
has been made for the Indus script. 



the problem of unseen n-grams. We describe how this 
problem is resolved using smoothing and backoff algo- 
rithms and demonstrate how this modifies the statistics 
obtained from the empirical analysis. Our main results, 
summarised above, are presented in Section H] using a 
bigram model. The Indus script corpora contains many 
instances of texts in which signs are damaged beyond 
recognition or otherwise illegible. The bigram model can 
be used to suggest a set of signs, ranked by probability, 
for the restoration of the damaged or illegible signs. In 
Section we outline the theory behind this method and 
provide examples on known texts from the corpus. In 
the final section, we discuss future work based on the 
n-gram approach and end with a summary of the present 
work. 

2 The corpus and empirical analysis 

Any work on the Indus script, statistical or otherwise, 
must begin with a corpus. Three major corpora of the 
Indus texts, by Mahadevan [2\, Parpola [3] and Wells (U 
are available. We use the electronic concordance of Ma- 
hadevan, henceforth referred to as M77, which records 
417 unique signal in 3573 lines of 2906 texts. In order to 
remove ambiguity, an Extended Basic Unique Data Set 
(EBUDS) is created by removing from the concordance 
all texts containing lost, damaged, illegible, or doubtfully 
read parts |12|. Texts spread across multiple lines are 
also removed. For texts that occur more than once, 
only a single copy is retained. Variations due to the 
archaeological context of the sites, stratigraphy, and type 
of object on which the texts are inscribed are, at present, 
not taken into account in the interests of retaining a 
reasonable sample size. 

The reasons for discarding texts which are spread over 
multiple lines are twofold. First, it is not clear if each line 
of a multi-line text is to be treated as a single text having 
a continuity of sequence, or if it is to be regarded as 
separate text. Second, assuming a continuity of sequence, 
the order in which the texts are to be read across lines 
remains ambiguous (2|. 

The pruned EBUDS dataset contains 1548 texts. In 
EBUDS, 40 signs out of 417 present in the sign list 
of M77 do not make their appearance. Out of these 
removed 40 signs, one sign (374) appears 9 times, one 
sign (237) appears 8 times, two signs (282, 390) appear 3 
times, three signs (324, 376, 378) appear twice and thirty- 
three signs appear only once in M77. Fig. [2] compares 
the distribution of texts lengths in various datasets and 
shows the effect of the removal of texts on the final 
dataset (EBUDS). We have already shown in our earlier 
work [12], that the relative frequency distribution of 
signs in EBUDS is comparable to M77 and hence EBUDS 

2. The serial number of the signs used in this paper is as given by 
Mahadevan in his concordance (M77). As a convention followed in the 
present paper, the texts depicted by pictures are to be read from right 
to left, whereas the texts represented by just strings of sign numbers 
are to be read from left to right (see M77 for discussion on direction 
of texts). 
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Fig. 2. Text length distributions in the different corpora 
used in the analysis. The raw corpus (M77) contains four 
instances of outliers, texts of length n = 2 and n = 3 which 
occur in unusually large numbers. Keeping only single oc- 
currences of these removes the sharp maximum around 
n = 2 in the raw corpus. The corpus free of the outliers 
is then reduced again to keep only unique occurrences 
of the texts. This gives the M77-unique corpus. Finally, 
damaged, illegible and multi-line texts are removed to give 
the EBUDS corpus. Texts of length n = 3 and n = 5 are 
most frequent in this corpus. 

is a true representation of M77, with a reduction in total 
sign occurrences, but not in the percentage of total sign 
occurrences. 

We first present the results of an empirical statisti- 
cal analysis of the EBUDS corpus. Fig. [3] shows the 
frequency distribution of signs in EBUDS. The sign 

corresponding to 342 in M77 ( \J ), is the most frequent 

sign, followed by signs 99 ("), 267('V > ), and 59 (H ). 
The relative frequencies have no significant change in 
the M77 and EBUDS corpora. 

2.1 Zipf-Mandelbrot law 

The same data can be plotted in a different way, as a 
rank-ordered frequency distribution. The most frequent 
sign is given rank r = 1 and its frequency is denoted by 
fx, the next most frequent sign is given rank r = 2 and 
its frequency is denoted as fi and so on, till all signs are 
exhausted. The rank-ordered frequency f r is then plotted 
against the rank r, on double logarithmic axes, as shown 
in Fig. |U Remarkably, the data can be fit very well to 
the Zipf-Mandelbrot law, log/ r = a — b\og(r + c) tl5l . 
The Zipf-Mandelbrot law is commonly used to fit the 
rank-ordered frequency distribution of words in corpora 
of natural languages. Qualitatively, a distribution which 
follows the Zipf-Mandelbrot law has a small number 
of tokens (words, signs) which contribute to the bulk 
of the distribution, but also a large number of rare 
signs which contribute to a long tail. To emphasise this 
point, it is estimated that English has approximately a 




50 100 150 200 250 300 350 400 

Sign No. 



Fig. 3. The frequency distribution of individual signs in the 
EBUDS corpus. The five most common signs are shown 
alongside the frequency bars. The relative frequency dis- 
tribution does not change significantly between EBUDS 
and M77 corpora. 
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Fig. 4. The rank-ordered frequency distribution of signs 
f r plotted against the rank r for the EBUDS corpus. The 
data is fitted well by the Zipf-Mandelbrot law, log/ r = a - 
61og(r+c). For c = and b = 1, this reduces to Zipf's Law, 
f r = a/r. Both these laws are used to fit the frequency 
distribution of words in linguistic corpora. Our fitted values 
are a = 15.39, b = 2.59 and c = 44.47. For English (the 
Brown Corpus), a = 12.43, b = 1.15 and c = 100 fT5] . 

million words, though a college graduate might know 
only between 60,000 to 75,000 of these, and yet be a 
completely competent user of the language [16\. The 
Indus script seems to follow the same pattern, with a 
small number of signs accounting for the majority of 
use, but with a large number of signs which are used 
infrequently. 

2.2 Cumulative distribution 

To further follow up this point, we plot the cumulative 
frequency distribution of the signs in EBUDS in Fig. [5] 
As can be seen from the graph about 69 signs account 
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for 80% of the total data and the most frequent sign 
(342) alone accounts for 10% of the complete EBUDS. 
This observation is consistent with previous analysis by 
Mahadevan for M77 corpus . In the same graph, we 
plot the cumulative distribution of text beginners, that 
is the sign beginning the text, and text enders, that is 
the sign ending the text. Here, an interesting asymmetry 
is evident. Approximately 82 text beginners account for 
80% of the text beginner usage, but only 23 text enders 
are needed to account for the same percentage of text 
ender usage. Since the possible number of text beginners 
and text enders can be any of the 417 signs, the numbers 
above indicate that both text beginners and text enders 
are well-defined, with text enders being more strictly 
defined than text beginners. This fact alone indicates 
that there is directionality in the use of the signs in 
Indus texts. Note that this conclusion of directionality 
is independent of any assumed direction of reading the 
sign sequence. Assuming the texts are read left to right 
would, instead, imply that text beginners are more strictly 
defined than text enders, again implying directionality. 

The analysis above has only been concerned with 
frequency distributions of single signs. We may extend 
the analysis to pairs of signs, triplets of signs and so 
on, as done by some of us earlier [12J, (131 . This allows 
us to explore the very important feature of order and 
correlations between the signs, which are the outward 
manifestations of syntax. Syntax can be recognised even 
when the semantics is doubtful. For example, Chom- 
sky's famous sentence "Colourless green ideas sleep fu- 
riously" becomes syntactically incorrect when the word 
order is changed to give, say, "Green furiously ideas 
colourless sleep". We recognise the first sentence as 
syntactically correct English, but of doubtful semantic 
value, but the second sentence is immediately recognised 
as syntactically invalid, even before we pause to think of 
its semantic content. It is this feature of the sign sequence 
of the Indus script, word order and correlation as an 
indicator of syntax, that we want to explore in detail, 
and which forms the main body of the present work. 
Such features are best explored in the n-gram model to 
be presented in the next section. 

2.3 Probability of sequences 

Before exploring n-gram models, we briefly mention 
what sort of frequency distribution we should expect in 
the absence of correlations and also establish the notation 
we will be using for the remaining part of the paper. 
We denote a sequence of N signs by Sn = s\s 2 . • . sn, 
where each Si is one of the 417 possible Indus signs. 
Each of these Sn is what we have been referring to as 
a text. The EBUDS corpus contains texts of maximum 
length N = 14. By our analysis above, we have obtained 
frequency distributions for the signs s ir by counting the 
number of times c(si) that sign Si occurs in the corpus, 
and then dividing it by the total size of the corpus. This 
is identified with the probability, in the sense of maximum 
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Fig. 5. The cumulative frequency distribution of all signs, 
only text beginners, and only text enders in the EBUDS 
corpus. Approximately 69 signs account for 80% of the 
corpus. The script thus has a large number of signs 
which are used infrequently. The cumulative distributions 
for text beginners and text enders show an asymmetry, 
with only 23 signs accounting for 80% of all text enders, 
while one needs to take 82 signs to account for 80% of 
text beginners. This is clear evidence of the directionality 
in the sign usage. 

likelihood, of seeing the sign Si in a text, 



In the absence of correlations, the joint probability that 
we see sign 52 after sign si is independent of si, and is 
just the product of their individual probabilities 

P(sis 2 ) = P( Sl )P(s 2 ) (2) 

Generalising, the probability of the string Sn = 
si s 2 ... sn is simply a product of the individual prob- 
abilities 

P(S N ) = P(s lS2 ...s N ) = P(s 1 )P(s 2 ) . . . P(s N ) (3) 

In the absence of correlations, then, we have a gener- 
alisation of die throwing, where instead of 6 possible 
outcomes, we have 417 possible outcomes in each throw, 
and the i th outcome has a probability P(si). Each throw 
outputs a single sign, and the outcome of a throw is 
independent of all previous throws. In N throws, we 
generate a line of text Sn- In probability theory, this 
process is called a Bernoulli scheme. In the next section, 
we will show that this is not adequate to model the 
sequence of Indus signs, indicating that there are signif- 
icant correlations. We can then expect that an arbitrary 
permutation of signs in an Indus text will produce a 
syntactically invalid sequence. 

3 n-GRAM MODEL FOR THE INDUS SCRIPT 

The main purpose of an n-gram model is to capture 
the correlations that exist between tokens si,..., s n 
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Fig. 6. The conditional probability P(b\a = 2) from the 
maximum likelihood estimate (above) and from Witten- 
Bell smoothing (below). The maximum likelihood estimate 
assigns zero probabilities to unseen sign pairs and results 
in a non-ergodic Markov chain. The Witten-Bell smoothing 
algorithm reduces the probabilities of the seen sign pairs 
and distributes the reduction over unseen sign pairs. 
This gives an ergodic Markov chain. The square root 
of conditional probabilities are plotted in each case to 
highlight the probabilities of unseen sign pairs. 

in a sequence Sn- Thus, conditional probabilities form 
the core of an n-gram model. Specifically, for a string 
Sn = S1S2 . . . sn the n-gram model is a specification of 
conditional probabilities of the form P(sn\sis 2 . . . s/v-i)/ 
quantifying the probability that the previous N — 1 signs 
of the sub string Sn-i = sis 2 . . . s/v-i is followed by the 
sign s N - Given the n-gram conditional probability, and 
the relation between joint and conditional probabilities 
P(ab) = P(b\a)P(a), the probability of the string Sn can 
be written as, 

P(S N ) = P(s N \si...s N -i)P(si...s N -i) (4) 
= P(s n \Sn-i)P(S N -i) 

Recursively applying P(ab) = P(b\a)P(a) to the leftmost 
terms, we obtain the probability as a product over con- 
ditional probabilities 

N 

^v)=n p (^-i) ( 5 ) 

i=i 

In the above, it is understood that So is a special token 
indicating the start of the string. Note that the above 
expression is an identity that follows from the basic 
rules of probability and contains no approximations. As 
an example, the probability of a string of length three, 
^1^2 S3, is given as a product of trigram, bigram and 
unigram probabilities 

P(sis 2 s 3 ) = P(s 3 \s 1 s 2 )P(s 2 \s 1 )P(s 1 ) (6) 

Clearly, for an n-gram model to be tractable, only a 
finite number of such probabilities can be retained. In 



the simplest bigram model, all correlations beyond the 
preceding sign are discarded, so that 

P(S N \S1 . . . SN-l) = P(s N \sN-l) (7) 

In a trigram model, all correlations beyond two preced- 
ing signs are discarded, so that 

P(s N \si . . . s N -i) = P(s n \sn-iSn-2) (8) 

In a general n-gram model, all correlations beyond the 
(n — 1) preceding signs are discarded. An n-gram model 
can then be thought of as an (n — l) th order Markov 
chain in a state space consisting of the signs Si. The 
introduction of the Markov assumption is the main 
approximation in n-gram models. TV-grams were first 
used by Markov to analyse the probability of a consonant 
following a vowel in Pushkin's Eugene Onegin. Shannon 
applied word n-grams to model sentences in English 
text and with n = 6 obtained sentences which have a 
remarkable syntactic similarity to English. Since then, n- 
gram models have found wide use in many fields where 
sequences are to be analysed, including bioinformatics, 
speech processing and music. 

The theory and applications of n-grams are described 
in several excellent textbooks HHJ, 021 and need not 
be repeated here. We summarise below our method of 
obtaining probabilities from counts, the use of smoothing 
and backoff to account for unseen n-grams, the measures 
we use for evaluating the n-gram model, and finally the 
tests we use to assign a statistical significance to the 
correlations. 

iV-gram probabilities are obtained from counts. For 
single signs, the counts are unambiguous. However, for 
sign pairs, it is possible that a rare combination may not 
be present in a small sized corpus. The count, and the 
resulting probability estimate for that sign pair, then is 
zero. However, in the absence of reasons to the contrary, 
common sense dictates that no sign pair should have 
a strictly zero probability This intuition is quantified 
by various rules which remove probability weight from 
seen sign pairs and distributes it to sign pairs which have 
never been seen, but are nonetheless not impossible. 
Common amongst such ''smoothing'' rules are Laplace's 
add-one rule, a method developed by Turing and Good 
in breaking the Enigma code, and a more recent algo- 
rithm due to Witten and Bell (TH). Here, we use the 
Witten-Bell algorithm to smooth our n-gram models. 
In Fig. we show the estimate of the probability of a 
sign being followed by sign 2, P(b\2) before smoothing 
and after Witten-Bell smoothing. In above panel, the 
only non-zero probabilities are those corresponding to 
signs 12, 14, 162, and 176. These probabilities sum to 
one, indicating that other sign pairs are impossible. The 
Witten-Bell smoothing algorithm restores a finite, but 
small probability to the unseen sign pairs, ensuring again 
that all probabilities sum to one. Apart from being a 
more reasonable way of estimating probabilities from 
counts, it also ensures that the resulting Markov chain is 
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ergodic@. An ergodic Markov chain is essential in such 
applications, since otherwise, probabilities of all strings 
containing unseen n-grams vanish. 

Smoothing is not the only way of correcting for unseen 
n-grams. Another method, called backoff uses probabil- 
ities of (n — l)-grams to estimate n-gram probabilities. 
Thus, the probability of unseen trigrams can be esti- 
mated from that of seen bigrams and unigrams. Here, 
we used the Katz backoff algorithm to estimate bigram, 
trigram and quadrigram probabilities when appropriate. 

The estimation of n-gram probabilities from n-gram 
counts is called learning. A learned n-gram model can 
then be tested to see if it produces n-grams in the corpus. 
To avoid circularity, the corpus is usually divided into a 
training set, from which the probabilities are estimated, 
and a test set, on which the model is evaluated. There are 
several standard measures for evaluating the goodness 
of an n-gram model. Here, we use a standard measure, 
the perplexity, which is related to the information the- 
oretic cross-entropy. We also do a cross-validation test 
using standard procedures. 

Finally, we need tests of association to ascertain the 
significance of sign pairs which appear more or less fre- 
quently than what would be predicted by the Bernoulli 
scheme model. For this, we use a log-likelihood test, 
testing the null hypothesis that there is no association 
between sign pairs. 

In the next section, we provide results obtained from 
an n-gram model of the Indus script. The analysis was 
done using the SRILM El and NSP [20] toolkits. 

4 n-GRAM ANALYSIS OF THE CORPUS 

In this section we present the n-gram probablities ob- 
tained from the EBUDS corpus and draw inferences on 
the structure of the script from an analysis of the n-gram 
probabilities. 

We introduce, following usual practice, two additional 
tokens < s > and < / s > which indicate the beginning and 
end of a text respectively. By convention, the unigram 
probability for the start token is unity, P(<s>) = 1, since 
every text must begin with <s>. The probability of sign 
a being a beginner is then P(<s>a) = (P(a\<s>), since 
P(<s>) = 1. The probability of sign a being an ender 
is P(a</s>). It is also assumed, as usual, that n-gram 
probabilities do not depend on the location of the signs 
in a text, or more formally, that the Markov chain is 
stationary. 

In any n-gram study, a maximum value of n has to 
be chosen in the interest of tractability, beyond which 
correlations are discarded. This can be done in an em- 
pirical fashion, balancing the needs of accuracy and com- 
putational complexity, using measures from informa- 
tion theory which discriminate between n-grams models 
with increasing n lUSI , (T7|, or by more sophisticated 

3. A Markov chain is ergodic if it is possible to reach every state 
from every state (not necessarily in one move). 



methods like the Akaike information criterion which 
directly provides an optimal value for n [21 J. 

In previous work, some of us have shown that bigram 
and trigram frequencies in the EBUDS corpus differ 
significantly from frequencies expected from a Bernoulli 
scheme (12|. The small size of the corpus limits the 
ability to assess significance of quadrigrams and beyond, 
when using the method in [T2|. In our subsequent work 
IH5I it has been shown that 88% of the texts of length 5 
and above can be segmented using frequent unigrams, 
bigrams, trigrams and quadrigrams and complete texts 
of length 2, 3 and 4. Moreover, frequent bigrams or texts 
of length 2 alone account for 52% of the segmented 
corpus. Thus the bulk of the corpus can be segmented 
with n-grams with n not exceeding 4, and almost half 
the corpus can be segmented into bigrams alone. This 
indicates that correlations beyond quadrigrams have 
relatively less significance. 

Here, we use an information theoretic measure, the 
perplexity, which we explain in detail below, to estimate 
an optimal value for n. Summarising the results of 
this analysis, which is presented in Table. HI we find 
that the perplexity monotonically decreases as n ranges 
from 1 to 3 (corresponding to unigram, bigram and 
trigram correlations), but then saturates beyond n = 4 
(corresponding to quadrigram and higher correlations). 
This confirms the observation made in (131 that pen- 
tagram and higher correlations are of relatively less 
significance in the EBUDS corpus. From the differential 
reduction in perplexity with increase in model order, 
it is clear that the most significant correlations are due 
to bigrams, with somewhat modest trigram correlations, 
and almost negligible quadrigram correlations. Our sub- 
sequent analysis is, therefore, wholly based on bigrams. 
The main conclusions that we draw in this paper on the 
structure of the script are expected to remain broadly 
unaltered with the inclusion of trigram and quadrigram 
correlations. We expect a modest improvement in the 
predictive power of the n-gram model when trigram 
and quadrigram correlations are included. The role of 
higher-order correlations will be more fully explored in 
forthcoming work. 

4.1 Inferences from bigram matrix 

We now present the results of a bigram based analy- 
sis of the sequence of signs in the EBUDS corpus. To 
remind ourselves, in the bigram model, it is assumed 
that the probability P(sn\s\ ...sn-i) depends only on 
the immediately preceding sign and is the same as 
P(sn\sn-i)- The bigram model is fully specified by the 
unigram probabilities P(si) and the bigram probabilities 

We compare the bigram probabilities in the absence 
of correlations with the bigram probabilities for EBUDS 
corpus after Witten-Bell smoothing in the two plots of 
Fig. If there is no correlation between b and a, we 
expect P(b\a) — P(b), that is the conditional probability 



STATISTICAL ANALYSIS OF INDUS SCRIPT 



7 



TABLE 1 

Significant sign pairs from the log-likelihood (LL) 
measure of association for bigrams. The 20 most 
frequent sign pairs (first column) are compared with the 
20 most significant sign pairs (third column). The most 

frequent sign pairs are not necessarily the most 
significant sign pairs, as measured by the log-likelihood 
measure of association. 



Sign Pciir 


Rank 


Frequency 


Significant 


Rank 


LL Value 




11 \l Cll V c / 


CEBUDS^I 


Sign Pair 


(LL) 




267, 99 


I 


168 


267 99 


I 


792.40 




2 


75 


<J<JU , O J 


2 




342, 176 


3 


59 


342, 1 


3 


286.46 


8,342 


4 


58 


51,130 


4 


252.02 


391,99 


5 


56 


342, 176 


5 


210.24 


347, 342 


6 


56 


347, 342 


6 


208.68 


342,1 


7 


48 


8,342 


7 


201.77 


293, 123 


8 


40 


293, 123 


8 


196.14 


87,59 


9 


39 


245,245 


9 


195.67 


48, 342 


10 


38 


130, 149 


10 


181.35 


171,59 


11 


36 


171,59 


11 


169.99 


249, 162 


12 


34 


249, 162 


12 


156.07 


89,211 


13 


34 


391,99 


13 


155.63 


245, 245 


14 


33 


222,254 


14 


147.21 


59,211 


15 


31 


182,293 


15 


137.34 


51,130 


16 


27 


150, 123 


16 


132.38 


65,67 


17 


27 


89,211 


17 


130.77 


99,67 


18 


26 


216,254 


18 


128.64 


162, 342 


19 


25 


171,8 


19 


114.87 


123, 343 


20 


25 


87,59 


20 


111.01 



of b is identical to the marginal probability. We show 
this marginal probability in the first plot of Fig. [71 In 
the second plot of Fig. we show the matrix of bigram 
probabilities P(b\a) that sign b follows sign a in the 
corpus. It is clear even by a casual observation, that there 
are significant differences between the two plots. This 
signals the presence of correlations in the script. 

In Fig. [H] we show the text beginner and text ender 
sign probability distributions. The more careful analysis 
using smoothed bigram probabilities confirms our earlier 
conclusion, based on a raw counts, that text enders are 
more strictly defined than text beginners. 

We can further analyse the nature of correlations 
of a sign with other signs preceding or following it 
using the results of bigram analysis. As an example, 
we explore the correlations of the three most frequent 
text beginners (sign numbers 267, 391 and 293) and 
the three most frequent text enders (sign numbers 342, 
176 and 211) shown in Fig. [5] with other signs. It can 
be inferred from the plots of conditional probabilities 
(Fig. i.e. P(6|267), P(6|391) and P(6|293) for the text 
beginners and P(342|a), P(176|a) and P(211\a) for the 
text enders, that the text beginners 267, 391 and 293 are 
more selective in terms of the number of signs which 
can follow them in comparison to the text enders 342, 
176 and 211 which can be preceded by relatively larger 
number of signs. Thus, there is greater flexibility for 
signs preceding the text enders than the signs which tend 
to follow the text beginners. Moreover, if we further go 
into the details of the restricted number signs 99, 123 
which follow the text beginners we find that the nature 



TABLE 2 

The entropy and mutual information of the EBUDS 

corpus. The entropy is smaller than a random 
equiprobable sequence of 417 signs. The mutual 
information is non-zero, indicating the presence of 
correlations between consecutive signs. 



Measure 


Random 


EBUDS 


Entropy (H) 


8.7039 


6.6811 


Mutual information (I) 





2.236 



TABLE 3 

Perplexity and the n-gram cross entropy H n (Q, P) for the 

EBUDS corpus. The perplexity reduces dramatically 
when bigram correlations are included, has a small but 
significant reduction with trigram correlations, but then 
saturates beyond quadrigram correlations. This indicates 
that a bigram model is optimal for capturing the syntax in 
the EBUDS corpus. 



n 


1 


2 


3 


4 


5 


Perplexity (V) 


68.82 


26.69 


26.09 


25.26 


25.26 


H n (Q,P) 


6.10 


4.74 


4.71 


4.66 


4.66 



of the most frequent signs following the text beginners 
is quite similar to the text enders (though in reverse 
direction), in terms of the number of signs which follow 
them as shown in the plots of P(6|99) and P(6|123) 
(Fig. [10]). In other words, the correlation between the 
text beginners and the signs following them is stronger 
than the correlation of these immediate neighbours of 
text beginners with the sign following them, a fact which 
helps us in finding the weaker and stronger junctions in 
the text lH3l . The diagonal elements P(b\b) of the matrix 
of bigram probabilities are the probabilities of sign pairs 
with same signs and the most frequent sign pairs with 
repeating signs in the corpus are (153, 153) and (245, 
245). 



4.2 Test of significance 

To quantitatively assess the significance of the correla- 
tions, and to obtain the statistically significant sign pairs, 
we need to test the null hypothesis that signs a and b 
are independent. In Table [l] we give the most frequent 
sign pairs as well as the ones which are most statistically 
significant. Here, we enumerate the 20 most significant 
sign pairs on the basis of log-likelihood measure of 
association for bigrams. It is interesting to note the most 
significant sign pairs are not always the most frequent 
ones given in the first column of Table Q] for comparison. 
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50 100 150 200 250 300 350 400 

Sign a 



Fig. 7. Bigram probability P(b\a) for a random distribution with no correlations amongst the signs (above) and for the 
EBUDS corpus (below). Horizontal lines in the upper matrix imply that the conditional probability of a sign b following 
a sign a is equal to probability of sign b itself. The bigram probability P(b\a) after Witten-Bell smoothing is shown in the 
lower plot. The difference between the two matrices indicates the presence of correlations in the texts. 
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TABLE 5 

Suggested restoration of signs missing from texts. The last column lists the suggested restorations in decreasing 

order probability (Left to Right). 



Text No. 



Text 



Incomplete Text Most Probable Restoration Probable Restored Sign 



4312 
4016 
5237 
2653 
5073 



9071 



4081 



342 43 53 70 
^342 S 70 67 ^53^7 3^ 
342 135 67 99 267 
342 347 127 43 99 267 
342 244 67 99 130 51 364 



342 53 70 
342 327 70 53 97 391 
342 135 67 99 
342 127 43 99 267 
342 67 99 130 51 364 



16937211 13 194 112 59 

ill. 



342 293 132 72 67 93 99 267 

U)m@ITOiii 

342 237 102 339 39 336 65 121 



342 37 211 13 112 59 

- m 

342 293 132 72 67 93 267 
342 237 339 39 336 65 121 



342 43 53 70 
342 327 70 67 53 97 391 
342 135 67 99 267 
342 347 127 43 99 267 

ximim 

342 3 67 99 130 51 364 



m»» uiit»'!!i4 nm» 



'16937 211 13 194 112 59 

ill. 



r 

342 293 132 72 67 93 99 267 

U)iii®l!TO;ii! 

342 237 102 339 39 336 65 121 



67 99 65 
267 391 

vum 

347 353 9 
3 343 244 



194 
99 402 342 



Ul</s> A 

102 65 



TABLE 6 

Suggested restoration of doubtfully read signs in the texts of M77 corpus. The last column lists the suggested 
restorations in decreasing order probability (Left to Right). The signs with asterisk sign at the top right are the 
doubtfully read signs which are being restored using the bi-gram model. 

Text No. Text Incomplete Text Most Probable Restoration Probable Restored Sign 



8302 



5317 



1 342 67 99 267 1 342 67 99 1 342 67 99 267 267 391 

f 1 Fl 



Kill u 

i47 73 72 65 342 73 72 65 342 347 73 72 65 347 

MM U 



1193 M'U 

342 149 149 



inr® \jw% urr® 

342 3 171 99391 342 3 99391 342 3 171 99391 171 43 67 

2179 *UIIIII°TO""<e> *lflHIII'<2> ^Ulllll^ 1 ^ <¥> 

1 342 96 409 99 267 1 342 96 99 267 1 342 96 409 99 267 409 

3396 mtumiri rAwmm r&wmt® aw 

169 194 39 336 72 65 339 169 194 39 336 72 339 169 194 39 336 72 65 339 65 67 70 

riiiw® iiiiw® n\Mm t 

211 39 336 59 99 391 39 336 59 99 391 211 39 336 59 99 391 211 

2802 iwii** muim uiin/*i** i/i</s> 

245 245109 130 51 176 33 17 245 109 130 51 176 33 17 342 245 109 130 51 176 33 17 342 245 
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0.08 
0.06 
- 0.04 
0.02 




Text Beginners 
















i - 




,) 

293 




. |j..,ulLlil,.j„J.ii.„ .iJiJ iL 


ii.Lj....iL .. ii J 


A i 



200 250 

Sign No. 



0.05 
/\ 0.04 
0.03 

V 

^ 0.02 
0.01 




Text Enders 


u ■ 




342 


i t 

176 21 '. ] 




IJ , III. . , l . .,,1 . ,. , 





150 200 250 

Sign No. 



Fig. 8. Probability P(a\<s>) of a sign a following the start token <s> (text beginners) and probability P(a</s>) of sign 
a preceding the end token </s> (text enders) from bigram matrix P(b\a) with Witten-Bell smoothing. Text beginners 
with a significant probability are more numerous than text enders at the same threshold of probability. 



^0.6- 
so 

£1 0.4 - 
OT 0.2 - 

o - 




150 200 250 

b {Sign No.) 



^ 0.8 

21 0.4 
0.2 




JkJl 




150 200 250 

a (Sign No.) 



Fig. 9. Conditional probability plots for text beginners a = 267, 391, 293 followed by sign b and for texts enders b = 342, 
176,211 preceded by sign a from bigram matrix P(b\a) with Witten-Bell smoothing. Text beginners are more selective 
in terms of the number of signs which can follow them than text enders, which can have a large number of signs 
preceding them. 




1. .11 .iLiiulkuhL ...Li J.JIL 



150 200 250 

b (Sign No.) 



IX, 0.05 



Jll 



LL. 



i. Ill ill. M L, I 



■I ll 



150 200 250 

b (Sign No.) 



Fig. 10. Conditional probability plots for sign b following text beginners a = 99 and a = 123. The number of signs 
following the signs 99 and 123 is greater than the number of signs following text beginners 267, 391 and 293 (Fig. [9]) . 



4.3 Entropy, Mutual Information and Perplexity 

Information theoretic measures can also be used to 
make an estimate of the role of correlations. We use 
the informational theoretic entropy H and the mutual 
information /, 

H = -£)P(a)log 2 P(a), (9) 

a 

ab \ \ / \ / / 

to quantify the amount of correlation in the sequence. 

In Table |2] we compare the entropy and mutual infor- 
mation of the corpus with that of a completely random 
sequence with no correlations. For this, the probability of 
signs is uniform P(a) = 1/417 and the joint probability 
is P(ab) = P(a)P(b). This gives an entropy of — log(417) 



and a vanishing mutual information. In contrast, the 
entropy of the EBUDS corpus is 6.3 and the mutual 
information is 2.3. This indicates the presence of cor- 
relations, but the difference between the entropy and 
mutual information also indicates that there is flexibility 
in the sign usage, and the probability of a sign is not 
completely determined by the preceding sign. 



A related information-theoretic measure, the cross- 
entropy, is useful in discriminating between n-gram 
models with different n. The main goal of n-gram anal- 
ysis is to construct a good model for the probability 
of sequences in a corpus. A metric which quantifies 
how close the estimated probability distribution is to the 
actual probability distribution is the cross-entropy. For a 
true distribution Q(a) and its estimate P(a), the cross 
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TABLE 4 

Examples of texts generated by the bigram model. The 
texts are to be read from the right to the left. Some of the 
texts generated by the model occur in the corpus. 



169 17 28 54 54 99 267 
14 234 303 402 112 99 267 
342 363 104973195939799254391 391 
342 327 67 319 131 100 130 175 150 
15 102 210 336 106 59 216 72 276 



entropy is defined as, 

H(Q,P) = -J2Q(a)log 2 P(a) (11) 

a 

The cross-entropy is minimum when the true and esti- 
mated probability distributions coincide, Q(a) = P(a). 
The perplexity V, which is the measure commonly used 
in natural language processing literature, is an exponen- 
tial of the cross-entropy, 

<p = 2 h{Q,p) (12) 

Of course, the true probability distribution Q(a) is not 
known, but it can be shown |lB|, (F7\ that for a corpus of 
size M obtained from a stationary and ergodic Markov 
chain, the cross-entropy is given by the limit, 

H(Q,P)= lim i--^log 2 P(a) (13) 

a 

In other words, the cross-entropy of the corpus, as 
the corpus size approaches infinity becomes the cross- 
entropy of the probability distribution Q(a) from which 
the corpus was generated. The previous formula does 
not require knowledge of Q(a), and can then be used 
to give an estimate of the cross-entropy for a large, but 
finite, corpus. The relation has obvious generalisations 
to joint probability distributions P(ab), P(abc) . . ., of 
bigrams, trigrams and higher n-grams. We denote the 
n-gram cross-entropy by H n (Q,P). 

Here, we measure the cross-entropy against a held out 
corpus. We see that the perplexity is reduced consider- 
ably when bigram correlations are taken into account. 
We have also evaluated the perplexity for trigram and 
quadrigram models, and this shows a monotonic reduc- 
tion in perplexity as shown in Table [3] This implies that 
correlations beyond the preceding sign are important, 
though the most important correlations comes from the 
preceeding sign. The perplexity of the bigram model is 



26.69 which is significantly lower than that of unigram 
model which equals 68.82. As discussed in the beginning 
of this section, this motivates our choice of retaining only 
bigram correlations for the present study. 

The bigram model can now be used to generate texts 
according to the Markov chain defined by the unigram 
and bigram probabilities. In Table H]we show examples 
of texts which have been generated by the bigram model. 
The performance of the model is discussed in Section [6] 

Our results in this section establish without doubt 
that there are significant correlations between signs in 
the Indus script, and reveal the presence of what can 
reasonably be called syntax. There are well-defined text 
beginners and text enders, sign order is important, and 
many statistically significant pairs of signs occur in the 
corpus. 

5 Restoring illegible signs 

An important practical use of the bigram model is to 
restore signs which are not legible in the corpus due to 
damage or other reasons. The basic idea is to use the 
bigram model to evaluate the probability of a suggested 
restoration, and chose the restoration with the highest 
probability. For example, consider the three sign text 
£3 = sis x ss in which the middle sign s x is illegible. We 
use the bigram model to evaluate the probability of the 
string for different choices of s x by 

P(S 3 ) = P(</s>|5 3 )P(5 3 |5 aj )P(a aj |5i)P(5i|<S>) (14) 

The most probable sign with which to restore the text is, 
then, the maximum of this probability over all possible 
signs s x . Since there are 417 possible signs, this can 
be accomplished by a direct enumeration. When the 
number of illegible signs is more, the space over which 
the maximisation needs to be done grows rapidly. With p 
illegible signs, there are 417 p values from which to pick a 
maximum. In such instances, instead of a direct search, a 
dynamic programming algorithm may be applied. Here, 
we use the well-known Viterbi algorithm IH51 for finding 
the most probable state sequence for a given observation 
in a hidden Markov model, suitably modified to make 
it applicable to a Markov chain, to find most probable 
sequence of signs. Our results for sign restorations are 
summarised in Table |5] We list the original text, a ran- 
domly chosen deletion for that text, the most probable 
restoration, and the next probable restorations obtained 
using the bigram model. We see that in all cases, the 
bigram model is successful in reproducing the deleted 
sign. This gives us confidence that the bigram model 
can be used to suggest restorations of illegible signs in 
various corpora. Table [6] gives few examples of how 
the model can be used for restoration of doubtfully read 
signs in the texts of M77 corpus. 

We can also use the model to find most probable texts 
of various lengths. In Table [H] we reproduce the most 
probable texts of length 4, 5 and 6 as predicted by the 
bigram model. It is quite remarkable that there are exact 
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70 71 72 73 74 75 76 77 78 79 




70 71 72 73 74 75 76 77 78 79 



70 71 72 73 74 75 76 77 78 79 




70 72 74 76 78 

Sensitivity (in%) 



positives, that is, the number of times the predicted signs 
matches with the signs under 90% area of the probability 
curve; otherwise, they are considered false negatives. 
This is repeated about 100 times with the first test set and 
training set and we plot the histogram of the sensitivity 
for these 100 runs. 

The cross validation test described above is repeated 
by taking each of the five equal parts of EBUDS as test set 
and concatenating remaining parts as training sets. The 
results are shown in Table [71 and Fig. [TTJ As can be seen 
from the plots the sensitivity of the model considering 
all signs under 90% area of the cumulative probability 
curve as true positives is 74%. 



Fig. 11. Sensitivity of the bigram model taking all signs 
under 90% area of the cumulative probability curve as 
true positives is 74%. The five plots are for five different 
sets of test and training sets of EBUDS as given in Tabled 

TABLE 7 

Mean Sensitivity (in %) with Standard Deviation of the 
model predicted from each of the five test sets PI, P2, 

P3, P4 and P5 



Test Set 


Train Set 


Mean Sensitivity 


STDEV 


PI 


{P2, P3, PA, P5} 


75 


2 


P2 


{PI, P3, P4, P5} 


74 


2 


P3 


{PI, P2, P4, P5} 


74 


2 


P4 


{PI, P2, P3, P5} 


75 


2 


P5 


{Pl, P2, P3, P4} 


72 


2 




Average 


74 


2 



instances of the most probable texts of length 4,5 and 
slight variant of most probable text of length 6 in the 
corpus. 

6 Model performance evaluation: 
Cross Validation 

We have already discussed perplexity in section |U Here 
we discuss another method to evaluate the performance 
of the model called cross validation. In cross validation, 
the corpus is divided into a training set, from which 
the probabilities are estimated, and a test set, on which 
the model is evaluated. The measure of goodness of the 
model calculated is sensitivity which is defined as 

Sensitivity = TP/ (TP + FN) (15) 

where TP is the count of true positives and FN is the 
count of false negatives. The ratio of training to test 
set size used is 80 : 20. We divide EBUDS into 5 equal 
parts i.e. PI, P2, P3, P4 and P5 each being one-fifth 
of the corpus. We start with first part, selecting that to 
be the test set and concatenate the remaining parts to 
form the first training set. We use the training set to 
learn the parameters of the model and use the test set to 
evaluate the goodness of the model. We drop out signs 
randomly from the test set and ask the model to fill in 
the dropped signs. We then count the number of true 



7 Summary and conclusions 

We conclude that a bigram model, equivalent to a first- 
order Markov chain, can be used to capture the corre- 
lations of signs in the Indus texts and gives important 
insights into the structure of Indus writing. Our study 
has revealed that the script carries syntax, with well- 
defined text beginners and text enders, directionality and 
sign order, and strong correlations between signs. Infor- 
mation theoretic measures indicate that the correlations 
are not completely rigid. This indicates that the script 
can certainly be considered as a formal language, but it 
remains to be seen if these features imply an underlying 
natural language. The tests from which we draw this 
conclusion are summarised in Table [9j The bigram model 
can be used to suggest probable restorations of illegible 
signs from a corpus. 

To the best of our knowledge, our work represents 
the first use of the methods of n-gram analysis to an 
undeciphered scipt. We believe stochastic methods hold 
considerable promise in elucidating syntax in undeci- 
phered scripts. We will present further results on the 
syntax of Indus script, based on stochastic methods, in 
our forthcoming work. 
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TABLE 8 

The most probable texts of length 4, 5 and 6 predicted by the model. Remarkably, exact instances of the predicted 
texts are present in the corpus for the 4-sign and 5-sign texts. For the 6-sign text, the same sequence, but with 2 

insertions, is found in the corpus. 



Text length Text with blank positions 


Most probable predicted text 


Text No. 


Closest matching text from M77 


4 </s> §Mf/M<s> 




1232 







342 43 99 267 




342 43 99 267 






2580 










342 43 99 267 


5 </ s > WS/S/SM <s> 


mv<> 


2476 







342 3 171 99 267 




342 3 171 99 267 


6 </ s > <s> 


imoc M <s> 


1322 







342 3 171 53 99 267 




342 3 171 53 230 175 99 267 



TABLE 9 
Major conclusions 



SI. No 


Test /Measure 


Results 


Fig. / Table No. 


Conclusions 


I. 


Zipf-Mandelbrot law 


Best fit for a= 15.4, b = 2.6, 

c = 44.5 (95% Confidence interval) 


Fig. a 


Small number of signs account for 
bulk of the data while a large number 
of signs contribute to a long tail. 


2. 


Cumulative frequency 
distribution 


69 signs: 80 % of EBUDS 

23 signs: 80 % of Text Enders 

82 signs: 80 % of Text Beginners 


Fig m 


Indicates asymmetry in usage of 417 
distinct signs. Suggests directionality 
and structure in writing. 


3. 


Bigram probability 


Conditional probability matrix for 
EBUDS is strikingly different from 
the matrix assuming no correlations. 


Fig. El 


Indicates presence of significant 
correlations between signs. 


4. 


Conditional probabilities 
of text beginners and 
text enders 


Restricted number of signs follow 
frequent text beginners whereas 
large number of signs precede 
frequent text enders. 


Fig. m 


Indicates presence of signs having 
similar syntactic functions. 


5. 


Log-likelihood 
significance test 


Significant sign pairs extracted. 


Tabled] 


The most significant sign pairs are not 
always the most frequent ones. 


6. 


Entropy 


Random: 8.73; EBUDS: 6.68 


Table H 


Indicates presence of correlations. 


7. 


Mutual information 


Random: 0; EBUDS: 2.24 


Table 


Indicates flexibility in sign usage. 


8. 


Perplexity 


Monotonic reduction as n 
increases from 1 to 5. 


Table |3] 


Indicates presence of long range 
correlations, see also fl2l, fl3l. 


9. 


Sign Restoration 


Restoration of missing and 
illegible signs. 


Tables 


Model can restore illegible signs, 
according to probability. 


10. 


Cross validation 


Sensitivity of the model = 74 % 


Table Fig. El 


Model can predict signs with 
74% accuracy. 
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